tf-idf 0
77b5aaf2826c95c98e5eb4ab830073de-Supplemental-Conference.pdf
Asystem of regions (also referred to as anetwork) can comprise multiple disjoint regions that exhibit shared activity patterns across a range of tasks. The auditory system is located in the superior temporal region of the brain. This region uniquely encodespitch, speech, and music, butisnot involvedinhigh-levellanguage comprehensionandproduction[Norman-Haignereetal.,2015,2019].Inourexperimentspertaining to programming language comprehension, we use the activity seen in the auditory system as a negativecontrol. ForthePython program comprehension experiment, individual programs were modeled using the period from the onset of the code/sentence problem until the buttonpress. See Fedorenko et al. [2010] for a discussion of the functional localization approach as it pertains to the language network.
A Brain regions
A system of regions (also referred to as a network) can comprise multiple disjoint regions that exhibit shared activity patterns across a range of tasks. The auditory system is located in the superior temporal region of the brain. The voxels were then filtered using gray-matter masking and (for MD and the Language systems) network localization. See Fedorenko et al. [2010] for a discussion of the functional localization approach as it pertains to the language network. For each brain system and each code property or code model, we run a separate MVP A analysis.
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.46)
PromotionGo at SemEval-2025 Task 11: A Feature-Centric Framework for Cross-Lingual Multi-Emotion Detection in Short Texts
This paper presents our system for SemEval 2025 Task 11: Bridging the Gap in Text-Based Emotion Detection (Track A), which focuses on multi-label emotion detection in short texts. We propose a feature-centric framework that dynamically adapts document representations and learning algorithms to optimize language-specific performance. Our study evaluates three key components: document representation, dimensionality reduction, and model training in 28 languages, highlighting five for detailed analysis. The results show that TF-IDF remains highly effective for low-resource languages, while contextual embeddings like FastText and transformer-based document representations, such as those produced by Sentence-BERT, exhibit language-specific strengths. Principal Component Analysis (PCA) reduces training time without compromising performance, particularly benefiting FastText and neural models such as Multi-Layer Perceptrons (MLP). Computational efficiency analysis underscores the trade-off between model complexity and processing cost. Our framework provides a scalable solution for multilingual emotion detection, addressing the challenges of linguistic diversity and resource constraints.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.55)
AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection
Demirok, Basak, Kutlu, Mucahid
With the rapid advancement of LLM models, they have become widely useful in various fields. While these AI systems can be used for code generation, significantly simplifying and accelerating the tasks of developers, their use for students to do assignments has raised ethical questions in the field of education. In this context, determining the author of a particular code becomes important. In this study, we introduce AIGCodeSet, a dataset for AI-generated code detection tasks, specifically for the Python programming language. We obtain the problem descriptions and human-written codes from the CodeNet dataset. Using the problem descriptions, we generate AI-written codes with CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash models in three approaches: i) generating code from the problem description alone, ii) generating code using the description along with human-written source code containing runtime errors, and iii) generating code using the problem description and human-written code that resulted in wrong answers. Lastly, we conducted a post-processing step to eliminate LLM output irrelevant to code snippets. Overall, AIGCodeSet consists of 2,828 AI-generated and 4,755 human-written code snippets. We share our code with the research community to support studies on this important topic and provide performance results for baseline AI-generated code detection methods.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Middle East > Qatar (0.04)
- Education (0.48)
- Information Technology (0.47)
MELO: An Evaluation Benchmark for Multilingual Entity Linking of Occupations
Retyk, Federico, Gasco, Luis, Carrino, Casimiro Pio, Deniz, Daniel, Zbib, Rabih
We present the Multilingual Entity Linking of Occupations (MELO) Benchmark, a new collection of 48 datasets for evaluating the linking of entity mentions in 21 languages to the ESCO Occupations multilingual taxonomy. MELO was built using high-quality, pre-existent human annotations. We conduct experiments with simple lexical models and general-purpose sentence encoders, evaluated as bi-encoders in a zero-shot setup, to establish baselines for future research.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Spain (0.04)
- (26 more...)
MasonTigers at SemEval-2024 Task 1: An Ensemble Approach for Semantic Textual Relatedness
Goswami, Dhiman, Puspo, Sadiya Sayara Chowdhury, Raihan, Md Nishat, Emran, Al Nahian Bin, Ganguly, Amrita, Zampieri, Marcos
This paper presents the MasonTigers entry to the SemEval-2024 Task 1 - Semantic Textual Relatedness. The task encompasses supervised (Track A), unsupervised (Track B), and cross-lingual (Track C) approaches across 14 different languages. MasonTigers stands out as one of the two teams who participated in all languages across the three tracks. Our approaches achieved rankings ranging from 11th to 21st in Track A, from 1st to 8th in Track B, and from 5th to 12th in Track C. Adhering to the task-specific constraints, our best performing approaches utilize ensemble of statistical machine learning approaches combined with language-specific BERT based models and sentence transformers.
- North America > United States (0.06)
- Asia > Middle East > Jordan (0.04)
- Africa (0.04)
Evaluating Embeddings for One-Shot Classification of Doctor-AI Consultations
Ojo, Olumide Ebenezer, Adebanji, Olaronke Oluwayemisi, Gelbukh, Alexander, Calvo, Hiram, Feldman, Anna
Effective communication between healthcare providers and patients is crucial to providing high-quality patient care. In this work, we investigate how Doctor-written and AI-generated texts in healthcare consultations can be classified using state-of-the-art embeddings and one-shot classification systems. By analyzing embeddings such as bag-of-words, character n-grams, Word2Vec, GloVe, fastText, and GPT2 embeddings, we examine how well our one-shot classification systems capture semantic information within medical consultations. Results show that the embeddings are capable of capturing semantic features from text in a reliable and adaptable manner. Overall, Word2Vec, GloVe and Character n-grams embeddings performed well, indicating their suitability for modeling targeted to this task. GPT2 embedding also shows notable performance, indicating its suitability for models tailored to this task as well. Our machine learning architectures significantly improved the quality of health conversations when training data are scarce, improving communication between patients and healthcare providers.
- North America > Mexico > Mexico City > Mexico City (0.05)
- South America (0.04)
- North America > United States (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Health Care Technology (0.68)
Disaster Tweets Classification using BERT-Based Language Model
Social networking services have became an important communication channel in time of emergency. The aim of this study is to create a machine learning language model that is able to investigate if a person or area was in danger or not. The ubiquitousness of smartphones enables people to announce an emergency they are observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e. disaster relief organizations and news agencies). Design a language model that is able to understand and acknowledge when a disaster is happening based on the social network posts will become more and more necessary over time.
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Vietnam > Hanoi > Hanoi (0.05)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Asia > Vietnam > Kon Tum Province > Kon Tum (0.04)
- Information Technology > Services (0.36)
- Media > News (0.34)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.99)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)
Hybrid consistency and plausibility verification of product data according to FIC
The labelling of food products in the EU is regulated by the Food Information of Customers (FIC). Companies are required to provide the corresponding information regarding nutrients and allergens among others. With the rise of e-commerce more and more food products are sold online. There are often errors in the online product descriptions regarding the FIC-relevant information due to low data quality in the vendors' product data base. In this paper we propose a hybrid approach of both rule-based and machine learning to verify nutrient declaration and allergen labelling according to FIC requirements. Special focus is given to the problem of false negatives in allergen prediction since this poses a significant health risk to customers. Results show that a neural net trained on a subset of the ingredients of a product is capable of predicting the allergens contained with a high reliability.
- Oceania > New Zealand > North Island > Waikato (0.04)
- Europe > Germany (0.04)
- Health & Medicine > Consumer Health (1.00)
- Education > Health & Safety > School Nutrition (1.00)
- Materials > Chemicals (0.68)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.68)
How to detect novelty in textual data streams? A comparative study of existing methods
Christophe, Clément, Velcin, Julien, Cugliari, Jairo, Suignard, Philippe, Boumghar, Manel
Since datasets with annotation for novelty at the document and/or word level are not easily available, we present a simulation framework that allows us to create different textual datasets in which we control the way novelty occurs. We also present a benchmark of existing methods for novelty detection in textual data streams. We define a few tasks to solve and compare several state-of-the-art methods. The simulation framework allows us to evaluate their performances according to a set of limited scenarios and test their sensitivity to some parameters.